Today we are going to
1) be investigating the results from the alignment and the differential expression analysis that you went through on the previous occassion - do these make sense? do we trust them?
2) learn how to use RMarkdown to more easily generate reproducible results
3) learn how to use ggplot2 to make, and annotate, publication ready plots (PDF)
- PCA (incl gene loadings)
- Volcano plot
4) Use IGV and Gviz to visualise coverage of genes

The idea is that you will be given most of the R code, but unlike the previous day you will be forced to “copy and paste”, as well as think about what the code is doing.

During the course of the day we will go through the exercises in this document, and there will be three types of “exercises”:
Try it yourself: Copy and paste the example code, but make a minor change, in order to test how it works
Question: These are discussion points which require you to think about a point and argue for/against a specific viewpoint
Assignment: These specifies the plots that you should hand in as todays assignment

Data description

(taken from previous day!)

We will continue looking at the data from the previous day, eg YAP and TAZ control peripheral myelination and the expression of laminin receptors in Schwann cells. Poitelon et al. Nature Neurosci. 2016. (http://www.nature.com/neuro/journal/v19/n7/abs/nn.4316.html) In the experiments performed in this study, YAP and TAZ were knocked-down in Schwann cells to study myelination, using the sciatic nerve in mice as a model. The material for RNA-seq was collected from 2 conditions (wt and YAP(kd)TAZ(kd)), each in 3 biological replicates:

Accession Condition Replicate
SRR3222409 KO 1
SRR3222410 KO 2
SRR3222411 KO 3
SRR3222412 WT 1
SRR3222413 WT 2
SRR3222414 WT 3

Reproducible research

Use RMarkdown for reproducible results as a way to connect your data with your “results” (eg plots), and discussion points.

An empty HTML document

Try it yourself 1: Generate a R Markdown and take a look at the example (see below for how to)
  • Open RStudio
  • Make a new R Markdown document (for example by clicking the top left arrow and choosing R Markdown).
  • Name the document something use (eg change the title from Untitled to for example Day 4 practical)
  • Choose the HTML format. Everything you do today will be collected in this document, which means you have have a “log” of all commands necessary to achieve the plots, along with the plots and your comments.
  • Then click on the Knit HTML button on the middle command line above your document, and see what it looks like.
Try it yourself 2: Change the chunk option so that you can see the plot command in the generated html (hint echo)
Try it yourself 3: Add a header above the plot by adding a # at the beginning of the line (see the useful links for other nice formatting options)
Try it yourself 4: Re-knit the HTML file and ensure your two changes are included.

When you have a small simple file like this re-knitting every time is useful, but as you progress to more and more complex, and longer, R Markdown files, you could run your commands directly in RStudio and only re-knit once you want to see how your HTML page looks like. So go to the plot command of your RMarkdown file and press

Command and Enter

you will now see the plot in the bottom right corner of your RStudio

Set up your environment

The first you should do is to ‘setup’ your environment, and this includes to load all required packages, and determine how the output should look like.

In general I find it easier to view “dependencies” by loading all R packages at the top, no matter how far down your document you will use them… Also I tend to add a comment (# to the right) of why I need a specific package, and also the link to the web-page containing the examples on how to use the package. The first time you load a package, it might not exists on your computer, so the below is a good code for testing if the package exists, and if not install it.

Note there are, at least, three major sources, or R packages
* cran
* bioconductor
* github
and they have slighly different installation instructions (for example install.packages vs biocLite) and load (library vs require)

if (!require("RColorBrewer")) {                       # test if the package can be found
  install.packages("RColorBrewer", dependencies = TRUE);  # install the package, if needed
}
library(RColorBrewer);                                # load the package
Try it yourself 5: Load all the packages we will be using today (see below)

The full list of packages:

library(dplyr);                          # summarise your data
library(ggplot2);                        # ggplot
library(knitr);                          # tables with more options on what they will look like... 
library(limma);                          # voom
library(VennDiagram);                    # https://rstudio-pubs-static.s3.amazonaws.com/13301_6641d73cfac741a59c0a851feb99e98b.html
library(grid);                           # required by make3wayVenn
source("https://bioconductor.org/biocLite.R");   # bioconductor is a "collection" of many bioinformatics packages
library(Gviz);
library(GenomicRanges);                  # used by Gviz
library(BSgenome.Hsapiens.UCSC.hg19);    # used by Gviz

Define where RStudio should be looking for additional files

setwd("/Users/halena/Documents/WABI/Teaching/2016_GU_Seqcourse");
Try it yourself 6: Copy the following files from yesterday into a single directory and then set the current working directory to this
  1. the tableCounts file (tableCounts_with_location.tab)
  2. results_DE.txt
  3. the six sorted .bam files
  4. the six .bai files for the above .bam files

(they can also be found on ruby on /home/teacher5/rna-seq-course)

List all the files in the current working directory, there are two commands that means the same thing list.files() and dir.

list.files()            # list ALL files
dir(pattern="*bam")     # list all BAM files

You will see again and again that there are multiple ways of achieving the same results in R, you have to figure out which one makes the most sense to you.

QC your results

Its important to quality control (QC) your results, and this is very easily done in R by “viewing” your data.

Load in the feature counts

tableCounts <- read.table(file="tableCounts", sep="\t", header=TRUE, skip=1);
Question 1: Why do you think we wrote skip=1 in the above command?

Look at the values in the table

head(tableCounts);
##               Geneid Chr   Start     End Strand Length
## 1 ENSMUSG00000102693   1 3073253 3074322      +   1070
## 2 ENSMUSG00000064842   1 3102016 3102125      +    110
## 3 ENSMUSG00000051951   1 3205901 3671498      - 465598
## 4 ENSMUSG00000102851   1 3252757 3253236      +    480
## 5 ENSMUSG00000103377   1 3365731 3368549      -   2819
## 6 ENSMUSG00000104017   1 3375556 3377788      -   2233
##   SRR3222409_Aligned.out.sam SRR3222410_Aligned.out.sam
## 1                          0                          0
## 2                          0                          0
## 3                         41                         31
## 4                          0                          0
## 5                          0                          0
## 6                          0                          0
##   SRR3222411_Aligned.out.sam SRR3222412_Aligned.out.sam
## 1                          0                          0
## 2                          0                          0
## 3                         46                         90
## 4                          0                          0
## 5                          0                          0
## 6                          0                          0
##   SRR3222413_Aligned.out.sam SRR3222414_Aligned.out.sam
## 1                          0                          0
## 2                          0                          0
## 3                         83                         83
## 4                          0                          0
## 5                          0                          0
## 6                          0                          0

If you are unsure about how a specific command is written you can always call help in either R or RStudio

help(read.table)
Question 2: What other options do the read.table function allow us?

In order to look at the data in a better way inside RStudio, although it wont work inside your RMarkdown document

View(tableCounts)

Look at the feature counts

Which gene is the most expressed?

Question 3: why wont max(tableCounts) work?

Different ways of using max to get the answer you need.

max(tableCounts[,c("SRR3222409_Aligned.out.sam", "SRR3222410_Aligned.out.sam", "SRR3222411_Aligned.out.sam", "SRR3222412_Aligned.out.sam", "SRR3222413_Aligned.out.sam", "SRR3222414_Aligned.out.sam")]) # will work but is awkward
## [1] 313845
max(tableCounts[,7:12]) # will work but if you add a column it will break!
## [1] 313845
# for each gene get the max count
maxCountPerGene <- summarise(group_by(tableCounts, Geneid), maxCount=max(SRR3222409_Aligned.out.sam, SRR3222410_Aligned.out.sam, SRR3222411_Aligned.out.sam, SRR3222412_Aligned.out.sam, SRR3222413_Aligned.out.sam, SRR3222414_Aligned.out.sam));
max(maxCountPerGene$maxCount); # get the max count from this directly...
## [1] 313845
# use head and order your data
head(maxCountPerGene[order(maxCountPerGene$maxCount, decreasing=TRUE),], n=1);
## # A tibble: 1 <U+00D7> 2
##               Geneid maxCount
##               <fctr>    <int>
## 1 ENSMUSG00000056569   313845

check that it did what you expected it to

head(maxCountPerGene);
## # A tibble: 6 <U+00D7> 2
##               Geneid maxCount
##               <fctr>    <int>
## 1 ENSMUSG00000000001     1711
## 2 ENSMUSG00000000003        0
## 3 ENSMUSG00000000028      123
## 4 ENSMUSG00000000031    19508
## 5 ENSMUSG00000000037       45
## 6 ENSMUSG00000000049        5

how is the gene expressed across conditions?

subset(tableCounts, Geneid %in% "ENSMUSG00000000001"); # test that the max count is indeed right
##                   Geneid Chr     Start       End Strand Length
## 11835 ENSMUSG00000000001   3 108107280 108146146      -  38867
##       SRR3222409_Aligned.out.sam SRR3222410_Aligned.out.sam
## 11835                        922                        759
##       SRR3222411_Aligned.out.sam SRR3222412_Aligned.out.sam
## 11835                       1037                       1711
##       SRR3222413_Aligned.out.sam SRR3222414_Aligned.out.sam
## 11835                       1557                       1329
Question 4: which way did you prefer?
Question 5: What is this gene? Have a look at the Ensembl database and see if you recognise it

http://www.ensembl.org/Mus_musculus/Gene/Summary?db=core;g=ENSMUSG00000056569;r=1:171150711-171161130

For which sample (control or case) did we see it?

subset(tableCounts, Geneid %in% "ENSMUSG00000056569");
##                  Geneid Chr     Start       End Strand Length
## 2875 ENSMUSG00000056569   1 171150711 171161130      +  10420
##      SRR3222409_Aligned.out.sam SRR3222410_Aligned.out.sam
## 2875                      44051                      58764
##      SRR3222411_Aligned.out.sam SRR3222412_Aligned.out.sam
## 2875                      57719                     249437
##      SRR3222413_Aligned.out.sam SRR3222414_Aligned.out.sam
## 2875                     211064                     313845
Question 6: Is this an outlier? or is this real?

Make a data frame of the counts

This will make it easier to access the alignment data later on

onlyCounts = tableCounts[,7:12];
rownames(onlyCounts) = tableCounts$Geneid;

How many genes have a zero count across all six samples?

# how many genes do we have at all?
nrow(maxCountPerGene);
## [1] 46078
# how many genes have a count of zero?
nrow(subset(maxCountPerGene, maxCount==0))
## [1] 25804
Question 7: Would you include these in your analysis?
Question 8: Where they included in the differential expression analysis?

Summarise the QC

Now you actually have quite a few plots, so in order to find your way around the HTML page, add a table of content by modifying your header section.

Have a look at how to specify your header section here: http://rmarkdown.rstudio.com/html_document_format.html

Once you added the TOC, go ahead and re-knit your document

Did you like the layout of the tableCounts table? No, then lets do something about it

kable(head(tableCounts, row.names=FALSE));
Geneid Chr Start End Strand Length SRR3222409_Aligned.out.sam SRR3222410_Aligned.out.sam SRR3222411_Aligned.out.sam SRR3222412_Aligned.out.sam SRR3222413_Aligned.out.sam SRR3222414_Aligned.out.sam
ENSMUSG00000102693 1 3073253 3074322 + 1070 0 0 0 0 0 0
ENSMUSG00000064842 1 3102016 3102125 + 110 0 0 0 0 0 0
ENSMUSG00000051951 1 3205901 3671498 - 465598 41 31 46 90 83 83
ENSMUSG00000102851 1 3252757 3253236 + 480 0 0 0 0 0 0
ENSMUSG00000103377 1 3365731 3368549 - 2819 0 0 0 0 0 0
ENSMUSG00000104017 1 3375556 3377788 - 2233 0 0 0 0 0 0

Lets rename the counts columns to something more useful.

names(tableCounts)[7:12] <- c("KO_1","KO_2", "KO_3", "WT_1", "WT_2", "WT_3");

Lets look at 10 genes with actual KO expression levels…

kable(head(tableCounts[order(tableCounts$KO_1, decreasing=TRUE),], n=10), row.names=FALSE);
Geneid Chr Start End Strand Length KO_1 KO_2 KO_3 WT_1 WT_2 WT_3
ENSMUSG00000101111 1 24613189 24613971 - 783 54423 30464 44254 49345 49269 58395
ENSMUSG00000100862 1 24613974 24614651 - 678 54071 33384 45354 52416 50684 58233
ENSMUSG00000026043 1 45311538 45349706 + 38169 46209 29860 42177 53756 43728 65661
ENSMUSG00000056569 1 171150711 171161130 + 10420 44051 58764 57719 249437 211064 313845
ENSMUSG00000102070 1 24614885 24615565 - 681 43041 25655 35137 39980 39521 43668
ENSMUSG00000101249 1 24615706 24616197 - 492 41612 24487 33500 38132 38003 46898
ENSMUSG00000001506 11 94936224 94953042 + 16819 39374 31226 37594 48452 40512 45238
ENSMUSG00000029661 6 4504814 4541543 + 36730 37772 26555 34362 41615 36010 43936
ENSMUSG00000018593 11 55394500 55420080 - 25581 35032 26919 33036 45051 40312 46417
ENSMUSG00000037742 9 78478449 78489151 - 10703 29426 22946 27972 41506 33347 33289

Now you have both the results (in the form of plots), along with the commands use to generate the plots, and even, in some ways, the data that was the basis for the plots, in the same document. In order to make your results even more re-producable, start adding comments, like figure legends to your plots.

Volcano plots

In order to make a volcano plot, you need to read in the differential expression results from the previous day, and then you could technically simply plot it, but in general you are interested in visually highlighting the differentially regulated genes.

Read in the differential results

Simply use read.table

diffExp <- read.table("results_DE.txt",sep="\t",header=TRUE);
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
## dec, : EOF within quoted string
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
## dec, : number of items read is not a multiple of the number of columns
nrow(diffExp);
## [1] 6400

Then on the command line outside RStudio check the number of lines in the file against the above count

wc -l results_DE.txt
Question 9: did these return the same number of lines?

The above wont work as there are lots of ’ in the file! this is a general problem with reading files into R, wheras the below will work.

diffExp <- read.table("results_DE.txt", sep="\t", header=TRUE, quote=""); 

Look at the two knock-outs

subset(diffExp, mgi_symbol %in% c("Taz","Yap1"))
##         ensembl_gene_id chromosome_name mgi_symbol
## 1997 ENSMUSG00000053110               9       Yap1
## 9633 ENSMUSG00000009995               X        Taz
##                                                      description
## 1997 yes-associated protein 1 [Source:MGI Symbol;Acc:MGI:103262]
## 9633                 tafazzin [Source:MGI Symbol;Acc:MGI:109626]
##            logFC   logCPM        LR     PValue        FDR
## 1997 -0.29413201 6.891383 6.0503383 0.01390363 0.09743683
## 9633 -0.09640493 4.893313 0.3092547 0.57813787 0.83972176
Question 10: Was the knock-out experiment successful?

Highlight significant genes

diffExp$threshold = as.factor(diffExp$FDR<=0.01 & abs(diffExp$logFC)>=2)

Construct the plot object

g <- ggplot(data=diffExp, aes(x=logFC, y=-log10(PValue), colour=threshold)) +
  geom_point(alpha=0.4, size=1.75) +
  theme(legend.position = "none") +
  xlim(c(-10, 10)) + ylim(c(0, 15)) +
  xlab("log2 fold change") + ylab("-log10 p-value")
print(g);
## Warning: Removed 82 rows containing missing values (geom_point).

Question 11: what do you think the error message from the above command means? Is this something we have to worry about?

Add the mouse symbols

Here we will add mouse symbols on top of the circles for the significantly expressed genes only, as adding it for all 13.000 genes will become unreadable.

genesSignificant <- diffExp[diffExp$threshold==TRUE,]; # first pull out the significant genes
g + geom_text(data=genesSignificant, aes(label= mgi_symbol), size=1.2, colour="black");
## Warning: Removed 82 rows containing missing values (geom_point).
## Warning: Removed 24 rows containing missing values (geom_text).

The above works as genesSignificant looks exactly the same (names of columns) as diffExp, which is because its a subset of the rows in it.

Question 12: is Taz and Yap among the significant genes?
Try it yourself 7: Change the above plot so that the text size is much larger
Assignment 1: Subset the volcano plot to only include -log10 pvalues above 5, how many of the significant genes are included in this plot?
Question 13: Why do you think we subset the significant genes to only those with an abs(logFC)>2?

Venn diagrams

Make a 3-way Venn diagram

make3wayVenn <- function(A, B, C, categories){
  AB=intersect(x=A, y=B);
  AC=intersect(x=A, y=C);
  BC=intersect(x=B, y=C);
  overlap = intersect(x=AB, y=C);
  
  grid.newpage();
  v <- draw.triple.venn(area1=length(A),area2=length(B),area3=length(C),
                        n12=length(AB), n13=length(AC), n23=length(BC), 
                        n123=length(overlap),
                        category = categories, scaled=FALSE, 
                        print.mode=c('raw','percent'), sigdigs=0,
                        lty = 'blank', 
                        fill = c('skyblue', 'pink1', 'mediumorchid'));
  grid.draw(v);
}
Question 14: What do you think the above code does?
Question 15: Do you like it?
expressedInKO1 <- subset(tableCounts, KO_1>10);
expressedInKO2 <- subset(tableCounts, KO_2>10);
expressedInKO3 <- subset(tableCounts, KO_3>10);
make3wayVenn(expressedInKO1$Geneid, expressedInKO2$Geneid, expressedInKO3$Geneid,
             categories=c("KO_1","KO_2","KO_3"));

Question 16: Is the overlap in expressed genes good?
Question 17: Where would you like to set the cutoff for when a gene is expressed?
Assignment 2: Make a 3-way venn diagram for the control genes

More Venn Diagrams

Try it yourself 8: type help(“draw”) to see a list of all the options from the VennDiagram package
Assignment 3: Make a 2-way venn diagram for the genes expressed in control vs the genes expressed in KO using the diffExp data frame

Other Venn packages

In my opinion Venn diagrams is one of the plots you are asked for again and again that R does not provide a really good answer for. If you Google for venn r you will get a long list of possible packages. I like the one we used here because it gives me the possibility to easily set the layout of it, and to show the percentage as well.

PCA plots

Take the normalised data (here we use log2), decide the annotations you want to add to the plot

pseudoCount = log2(onlyCounts + 1);
annotations <- data.frame(Condition=c("KO","KO", "KO", "WT", "WT", "WT"), 
                          Names=names(pseudoCount));

Actually run the PCA analsysis

pseudoCount.t <- t(pseudoCount);      # transpose the data
pcaObject <- prcomp(pseudoCount.t);   # perform the PCA analysis

Add the annotations to the pca results, calculate how much each principal component contributes

X <- merge(x = pcaObject$x, y = annotations, 
           by.x="row.names", by.y = "Names", all.x=TRUE);
percentage <- round(pcaObject$sdev^2 / sum(pcaObject$sdev^2) * 100, 2);
df <- as.data.frame(pcaObject$x);
percentageLabs <- paste( colnames(df), "-", paste( as.character(percentage), "%", sep="") );

Show the results as an annotated plot

ggplot(X, aes_string(x="PC1", y="PC2", color="Condition"))+
  geom_point() +
  labs(title = "PCA") +
  xlab(percentageLabs[1]) + ylab(percentageLabs[2]) +
  geom_hline(yintercept=0) + geom_vline(xintercept=0);

Question 18: do you think the KO and the WT are separated from each other?
Question 19: is the percentage of variance explained by PC1 and PC2 high or low?

Make a gene loading plot

showGeneLoadings <- function(pcaObject, N=30, numberOfPCToShow=4){
  gl <- as.data.frame(pcaObject$rotation);
  gl <- merge(x=gl, y=diffExp[,c("ensembl_gene_id","mgi_symbol")], by.x="row.names", by.y="ensembl_gene_id", all.x=TRUE);
  geneLoadings = array(data=0, dim=numberOfPCToShow);
  for(g in 1:numberOfPCToShow){
    p=paste("PC",g,sep="");
    if(p %in% colnames(gl)){
      gl.cur <- gl[, c("Row.names", "mgi_symbol", p)];
      names(gl.cur) = c("Ensembl", "Symbol", "Loading");
      gl.pc <- head(gl.cur[order(abs(gl.cur$Loading), decreasing=TRUE),], n=N); # pick the top N gene loadings
      sumInfluence=round(sum(t(abs(gl.pc$Loading))), digits=4);
      gl.pc$Symbol <- reorder(gl.pc$Symbol, gl.pc$Loading); # needed in order to make ggplot order the data in the plot
      gPlot <- ggplot(gl.pc, aes(x=Loading, y=factor(Symbol)))+
        geom_point(color="Blue", aes(order=Loading))+
        labs(x="Loading",y="Gene",title=sprintf("Top %i genes in %s", N, p));
      print(gPlot);
    }
  }
}
showGeneLoadings(pcaObject=pcaObject)

Try it yourself 9: Show only the top 10 gene loadings for the 6 first principle components
Try it yourself 10: what would happen if you asked for the first 10 principal components?
Question 20: how many principal components are there?
Question 21: can you find Taz and Yap among the top gene loading genes? (does this make sense)
Question 22: how does the top gene loading genes correspond to the differentially expressed genes?
Question 23: PC2 seem to consist of genes mostly from a single family, which one and what does it do?

Visualise coverage of chromosomal neighborhood

in IGV (standalone tool)

What we need:
1) IGV - the actual program
2) genome file (.fa) - “comes with” IGV
3) sorted .bam file of interest(s)
4) index of the sorted .bam file of interest(s), eg the .bai files

Genome file

Inside IGV:

Genomes  
Load genomes from server  
Select 'Mouse mm10'  

Load a .bam file

Inside IGV:

File  
Load from file  
Select one of the .bam files - note that the .bam and the .bam.bai have to be located in the same folder

Do the above for all six .bam files

View a gene

Inside IGV:

In the textbox (top middle), type Taz and then click on the Go button  
Question 24: Is Taz differentially expressed in your opinion?
Try it yourself 11: View one of the top genes found in the gene loadings of the principal components analysis, is this one differentially expressed?

Via Gviz in R

https://bioconductor.org/packages/release/bioc/html/Gviz.html
https://bioconductor.org/packages/release/bioc/vignettes/Gviz/inst/doc/Gviz.pdf

Example from the vignette

Plot CpG islands on chromosome 7 from the human genome hg19 - for a full descriptions of what the commands means, browse through the vignette on your own

data(cpgIslands);
chr <- as.character(unique(seqnames(cpgIslands)))
gen <- genome(cpgIslands);
atrack <- AnnotationTrack(cpgIslands, name = "CpG");
plotTracks(atrack);

gtrack <- GenomeAxisTrack();
plotTracks(list(gtrack, atrack));

itrack <- IdeogramTrack(genome = gen, chromosome = chr);
plotTracks(list(itrack, gtrack, atrack));

Add gene information

data(geneModels);
grtrack <- GeneRegionTrack(geneModels, genome = gen, chromosome = chr, name = "Gene Model")
plotTracks(list(itrack, gtrack, atrack, grtrack));

Try it yourself 12: make the chromosome at the bottom instead of at the top
plotTracks(list(itrack, gtrack, atrack, grtrack), from = 26700000, to = 26750000);

plotTracks(list(itrack, gtrack, atrack, grtrack), extend.left = 0.5, extend.right = 1000000);

Question 25: come up with an example in which it would be nice to visualise the up-stream region.

Show the bases

strack <- SequenceTrack(Hsapiens, chromosome = chr);
plotTracks(list(itrack, gtrack, atrack, grtrack, strack), 
           from = 26591822, to = 26591852, cex = 0.8);

Add gene symbols

grtrack <- GeneRegionTrack(geneModels, genome = gen, 
                           chromosome = chr, name = "Gene Model", transcriptAnnotation = "symbol",
                           background.title = "brown");
plotTracks(list(itrack, gtrack, atrack, grtrack));

Try it yourself 13: change the background color of the CpG track to yellow
Assignment 4: make a Gviz plot of the mouse genome highlighting your favorite gene, which tracks did you choose to include and why?

Show aligned reads

Question 26: what types of data tracks can you visualise (hint page 19-20 in the vignette)

Coverage (from bam)

Find the list of all BSgenomes - download the right one
- http://bioconductor.org/packages/release/bioc/html/BSgenome.html

Find the mouse package name
install the package for mouse

Load the .bam file

dTrack4 <- DataTrack(range = "SRR3222411_Aligned.out.bam.sorted.bam", genome = "mm10",
                     type = "h", name = "Coverage", window = NULL, 
                     options(ucscChromosomeNames=FALSE));

The ucscChromosomeNames=FALSE is important if the chromosome names in the .bam file does not contain the chr part, to get a complete list of chromosome names use samtools from the command line

samtools view SRR3222411_Aligned.out.bam.sorted.bam | cut -f 3 | sort | uniq -c
Coverage for Taz
plotTracks(dTrack4, from = 74280718, to = 74292151, chromosome="X");

Question 27: do the two representations agree?

Display parameters

Have a look at the settings sections of the Gviz manual (https://bioconductor.org/packages/devel/bioc/manuals/Gviz/man/Gviz.pdf)

gtrack <- GenomeAxisTrack(col="#FF00FF", cex=1.2);
itrack <- IdeogramTrack(genome = gen, chromosome = chr, fontcolor="red");
plotTracks(list(itrack, gtrack, atrack, grtrack));

Try it yourself 14: make the font for the GenomeAxisTrack larger and make the lines magenta, change the color of the chromosome name for the IdeogramTrack
Re-plot the tracks to see that the above changes have taken effect
Try it yourself 15: Change another display option for one of the other visible tracks

To see the available display parameters

availableDisplayPars("GenomeAxisTrack");
availableDisplayPars("IdeogramTrack");

More Gviz examples http://www.sthda.com/english/wiki/print.php?id=43

IGV vs GViz

Question 28: which representation (IGV or Gviz) did you prefer? and why?
Assignment 5: include the 6 .bam files (control and KO) as grouped line plots (page 21) to the previous assignment

Extra Assignments For The Curious

MA plots

Microarray

An MA plot is an application of a Bland-Altman plot for visual representation of two channel DNA microarray gene expression data which has been transformed onto the M (log ratios) and A (mean average) scale. (from wikipedia)

RNASeq

For weakly expressed genes, we have no chance of seeing differential expression, because the low read counts suffer from such high Poisson noise that any biological effect is drowned in the uncertainties from the sampling at a low rate. (http://www.bioconductor.org/help/workflows/rnaseqGene/)

An MA-plot is a plot of log-fold change (M-values, i.e. the log of the ratio of level counts for each gene between two samples) against the log-average (A-values, i.e. the average level counts for each gene across the two samples). The MA-plot is a useful to visualize reproducibility between samples of an experiment. From a MA-plot one can see if normalization is needed. In MA plot, genes with similar expression levels in two samples will appear around the horizontal line y = 0 (blue). A lowess t (in red) is plotted underlying a possible trend in the bias related to the mean expression. (http://www.nathalievilla.org/doc/pdf/tutorial-rnaseq.pdf)

Make a MA plot

makeMAPlot <- function(dat, conditionColumns, caseColumns){
  x = rowMeans(as.matrix(dat[,conditionColumns]));
  y = rowMeans(as.matrix(dat[,caseColumns]));
  M = x - y;     ## M-values
  A = (x + y)/2; ## A-values
  df = data.frame(A, M);
  ggplot(df, aes(x = A, y = M)) + 
    geom_point(size = 1.5, alpha = 1/5) +
    geom_hline(color = 'blue3', yintercept=0) + 
    stat_smooth(se = FALSE, method = 'loess', color = 'red3');
}
makeMAPlot(pseudoCount, 1:3, 4:6);

Try it yourself 1: what would it look like if we didn’t log-transform the data?
Question 1: do you think the data needs to be normalised?
v <- voom(onlyCounts);
makeMAPlot(v, 1:3, 4:6);

Question 2: what did this change? was it for the better?

Is there a relation between read count and gene length?

maxCountPerGeneAndLength <- summarise(group_by(tableCounts, Geneid, Length), maxCount=max(KO_1,KO_2, KO_3, WT_1, WT_2, WT_3));
ggplot(maxCountPerGeneAndLength, aes(x=Length, y=maxCount))+
  geom_point();

The above is not entirely useful, so log-transform your values

ggplot(maxCountPerGeneAndLength, aes(x=Length, y=log2(maxCount)))+
  geom_point();

Question 3: Is it “ok” to only log transform one axis, or should you do both?
ggplot(maxCountPerGeneAndLength, aes(x=log2(Length), y=log2(maxCount)))+
  geom_point();

Question 4: Did that change your answer?

Generate a PDF file of a plot (or all plots from a R session)

From inside RStudio

In the bottom right plot panel

Export
PDF

From ‘normal’ R

pdf(file="volcanoPlot.pdf");
g <- ggplot(data=diffExp, aes(x=logFC, y=-log10(PValue), colour=threshold)) +
  geom_point(alpha=0.4, size=1.75) +
  theme(legend.position = "none") +
  xlim(c(-10, 10)) + ylim(c(0, 15)) +
  xlab("log2 fold change") + ylab("-log10 p-value")
print(g);
dev.off() # not a good idea to run from inside your RMarkdown sesssion!

Note that if you run the above code (eg the pdf and dev commands) you will no longer see the plot from inside RStudio!

Note you can easily make a multipage PDF file containing all plots by calling pdf() at the start of your R session and dev.off() at the end…

From RMarkdown

It is possible to “tell” RMarkdown to generate a “picture” of each plot

knitr::opts_chunk$set(fig.path = 'plots/', dev = 'pdf'); # save all plots in the plots folder so I can access them from powerpoint as well..

The above will create a folder called plots and then make a .pdf file for each.

Of course these type of commands should be part of the ‘setup’ section at the top of your RMarkdown document.

Try it yourself 2: Add the command to the top of your document, re-knit it, look at the “progress” at the bottom of your RStudio
Question 5: Look at the list of generated pdf files, did you get as many as you expected? Why/Why not? Was the names useful?

Name your code chunks

Instead when creating a plot, try naming the code chunk eg to call the code chunk VolcanoPlot simply add it at the top of the current code chunk (after the r add a space followed by the name of the chunk).

ggplot(data=diffExp, aes(x=logFC, y=-log10(PValue), colour=threshold)) +
  geom_point(alpha=0.4, size=1.75) +
  theme(legend.position = "none") +
  xlim(c(-10, 10)) + ylim(c(0, 15)) +
  xlab("log2 fold change") + ylab("-log10 p-value")
## Warning: Removed 82 rows containing missing values (geom_point).

now this plot will be called VolcanoPlot-1.pdf in the plots folder, and when you knit your document these names are shown at the bottom of your RStudio